AdvGLUE

The Adversarial GLUE Benchmark

Performance of RoBERTa (single model) on AdvGLUE

Overall Statistics

96.0Accuracy 61.151.470.358.592.089.4F1 Accuracy 64.751.548.69.524.614.057.141.894.1Accuracy 62.549.228.552.586.6Accuracy 42.349.345.489.7Accuracy 55.843.150.889.90100Accuracy 58.0010036.5010022.5010039.60100
GLUE DevAdvGLUE WordAdvGLUE SentenceAdvGLUE HumanAdvGLUE OverallSST-2QQPQNLIRTEMNLI-mMNLI-mm

Performance of RoBERTa (single model) on each task

The Stanford Sentiment Treebank (SST-2)

56.269.266.757.756.3Typo Knowledge Embedding Context Composition 45.061.1Syntactic Distraction 70.30100CheckList
Adversarial AccWordSentenceHuman

Quora Question Pairs (QQP)

64.082.469.060.062.9Typo Knowledge Embedding Context Composition 47.166.738.122.258.248.6Syntactic 9.524.60100CheckList 14.00100
Adversarial AccAdversarial F1WordSentenceHuman

MultiNLI (MNLI) matched

59.360.750.057.354.1Typo Knowledge Embedding Context Composition 42.344.40100Syntactic Distraction
Adversarial AccWordSentence

MultiNLI (MNLI) mismatched

44.667.270.465.556.4Typo Knowledge Embedding Context Composition 31.145.9Syntactic Distraction 16.628.40100StressTest ANLI
Adversarial AccWordSentenceHuman

Question NLI (QNLI)

68.159.253.459.268.3Typo Knowledge Embedding Context Composition 39.362.9Syntactic Distraction 35.123.80100CheckList AdvSQuAD
Adversarial AccWordSentenceHuman

Recognizing Textual Entailment (RTE)

47.835.548.829.645.5Typo Knowledge Embedding Context Composition 46.654.20100Syntactic Distraction
Adversarial AccWordSentence